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Abstract 


We address the problem of partially-labeled multiclass classification, where instead of a single la- 
bel per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our 
setting is motivated by a common scenario in many image and video collections, where only partial 
access to labels is available. The goal is to learn a classifier that can disambiguate the partially- 
labeled training instances, and generalize to unseen data. We define an intuitive property of the 
data distribution that sharply characterizes the ability to learn in this setting and show that effec- 
tive learning is possible even when all the data is only partially labeled. Exploiting this property 
of the data, we propose a convex learning formulation based on minimization of a loss function 
appropriate for the partial label setting. We analyze the conditions under which our loss function 
is asymptotically consistent, as well as its generalization and transductive performance. We apply 
our framework to identifying faces culled from web news sources and to naming characters in TV 
series and movies; in particular, we annotated and experimented on a very large video data set and 
achieve 6% error for character naming on 16 episodes of the TV series Lost. 

Keywords: weakly supervised learning, multiclass classification, convex learning, generalization 
bounds, names and faces 


1. Introduction 


We consider a weakly-supervised multiclass classification setting where each instance is partially 
labeled: instead of a single label per instance, the algorithm is given a candidate set of labels, only 
one of which is correct. A typical example arises in photographs containing several faces per image 
and a caption that only specifies who is in the picture but not which name matches which face. In 
this setting each face is ambiguously labeled with the set of names extracted from the caption, see 
Figure 1 (bottom). Photograph collections with captions have motivated much recent interest in 
weakly annotated images and videos (Duygulu et al., 2002; Barnard et al., 2003; Berg et al., 2004; 
Gallagher and Chen, 2007). Another motivating example is shown in Figure 1 (top), which shows 
a setting where we can obtain plentiful but weakly labeled data: videos and screenplays. Using a 
screenplay, we can tell who is in a given scene, but for every detected face in the scene, the person’s 
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Scene 1 Y Scene 2 Y 






Scene 3 
Charlie: Look! Jack: Where are the bottles? Hurley: Hey 
Hurley: Look what? Charlie: Somebody stole them. Jack: What’s up? 


Elena Dementieva and Anna Elena Dementieva picked up her third title of the year Maria Sharapova meets 
Kournikova team up. and denied compatriot Maria Sharapova a comeback win. Anna Kournikova 


Figure 1: Two examples of partial labeling scenarios for naming faces. Top: using a screenplay, 
we can tell who is in a movie scene, but for every face in the corresponding images, the 
person’s identity is ambiguous (green labels). Bottom: images in photograph collections 
and webpages are often tagged ambiguously with several potential names in the caption 
or nearby text. In both cases, our goal is to learn a model from ambiguously labeled ex- 
amples so as to disambiguate the training labels and also generalize to unseen examples. 


identity is ambiguous: each face is partially labeled with the set of characters appearing at some 
point in the scene (Satoh et al., 1999; Everingham et al., 2006; Ramanan et al., 2007). The goal in 
each case is to learn a person classifier that can not only disambiguate the labels of the training faces, 
but also generalize to unseen data. Learning accurate models for face and object recognition from 
such imprecisely annotated images and videos can improve the performance of many applications, 
including image retrieval and video summarization. 

This partially labeled setting is situated between fully supervised and fully unsupervised learn- 
ing, but is qualitatively different from the semi-supervised setting where both labeled and unlabeled 
data are available. There have been several papers that addressed this partially labeled (also called 
ambiguously labeled) problem. Many formulations use the expectation-maximization-like algo- 
rithms to estimate the model parameters and “fill-in” the labels (Côme et al., 2008; Ambroise et al., 
2001; Vannoorenberghe and Smets, 2005; Jin and Ghahramani, 2002). Most methods involve ei- 
ther non-convex objectives or procedural, iterative reassignment schemes which come without any 
guarantees of achieving global optima of the objective or classification accuracy. To the best of our 
knowledge, there has not been theoretical analysis of conditions under which proposed approaches 
are guaranteed to learn accurate classifiers. The contributions of this paper are: 


e We show theoretically that effective learning is possible under reasonable distributional as- 
sumptions even when all the data is partially labeled, leading to useful upper and lower bounds 


on the true error. 


e We propose a convex learning formulation based on this analysis by extending general multi- 
class loss functions to handle partial labels. 
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e We apply our convex learning formulation to the task of identifying faces culled from web 
news sources, and to naming characters in TV series. We experiment on a large data set 
consisting of 100 hours of video, and in particular achieve 6% (resp. 13%) error for character 
naming across 8 (resp. 32) labels on 16 episodes of Lost, consistently outperforming several 
strong baselines. 


e We contribute the Annotated Faces on TV data set, which contains about 3,000 cropped faces 
extracted from 8 episodes of the TV show Lost (one face per track). Each face is registered 
and annotated with a groundtruth label (there are 40 different characters). We also include a 
subset of those faces with the partial label set automatically extracted from the screenplay. 


e We provide the Convex Learning from Partial Labels Toolbox, an open-source matlab and 
C++ implementation of our approach as well as the baseline approach discussed in the paper. 
The code includes scripts to illustrate the process on Faces in the Wild Data Set (Huang et al., 
2007a) and our Annotated Faces on TV data set. 


The paper is organized as follows.! We review related work and relevant learning scenarios 


in Section 2. We pose the partially labeled learning problem as minimization of an ambiguous 
loss in Section 3, and establish upper and lower bounds between the (unobserved) true loss and the 
(observed) ambiguous loss in terms of a critical distributional property we call ambiguity degree. We 
propose the novel Convex Learning from Partial Labels (CLPL) formulation in Section 4, and show 
it offers a tighter approximation to the ambiguous loss, compared to a straightforward formulation. 
We derive generalization bounds for the inductive setting, and in Section 5 also provide bounds for 
the transductive setting. In addition, we provide reasonable sufficient conditions that will guarantee 
a consistent labeling in a simple case. We show how to solve proposed CLPL optimization problems 
by reducing them to more standard supervised optimization problems in Section 6, and provide 
several concrete algorithms that can be adapted to our setting, such as support vector machines and 
boosting. We then proceed to a series of controlled experiments in Section 7, comparing CLPL to 
several baselines on different data sets. We also apply our framework to a naming task in TV series, 
where screenplay and closed captions provide ambiguous labels. The code and data used in the 
paper can be found at: http: //www.vision.grasp.upenn.edu/video 


2. Related Work 


We review here the related work for learning under several forms of weak supervision, as well 
concrete applications. 


2.1 Weakly Supervised Learning 


To put the partially-labeled learning problem into perspective, it is useful to lay out several related 
learning scenarios (see Figure 2), ranging from fully supervised (supervised and multi-label learn- 
ing), to weakly-supervised (semi-supervised, multi-instance, partially-labeled), to unsupervised. 


e In semi-supervised learning (Zhu and Goldberg, 2009; Chapelle et al., 2006), the learner has 
access to a set of labeled examples as well as a set of unlabeled examples. 





1. A preliminary version of this work appeared in Cour et al. (2009). Sections 4.2 to 6 present new material, and 
Sections 7 and 8 contain additional experiments, data sets and comparisons. 
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supervised unsupervised semi-supervised multi-label multi-instance partial-label 
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Figure 2: Range of supervision in classification. Training may be: supervised (a label is given for 
each instance), unsupervised (no label is given for any instance), semi-supervised (la- 
bels are given for some instances), multi-label (each instance can have multiple labels), 
multi-instance (a label is given for a group of instances where at least one instance in the 
group has the label), or partially-labeled (for each instance, several possible labels are 
given, only one of which is correct). 


e In multi-label learning (Boutell et al., 2004; Tsoumakas et al., 2010), each example is as- 
signed multiple labels, all of which can be true. 


e In multi-instance learning (Dietterich et al., 1997; Andrews and Hofmann, 2004; Viola et al., 
2006), examples are not individually labeled but grouped into sets which either contain at least 
one positive example, or only negative examples. A special case considers the easier scenario 
where label proportions in each bag are known (Kuck and de Freitas, 2005), allowing one to 
compute convergence bounds on the estimation error of the correct labels (Quadrianto et al., 
2009). 


e Finally, in our setting of partially labeled learning, also called ambiguously labeled learning, 
each example again is supplied with multiple labels, only one of which is correct. A formal 
definition is given in Section 3. 


Clearly, these settings can be combined, for example with multi-instance multi-label learning 
(MIML) (Zhou and Zhang, 2007), where training instances are associated with not only multiple 
instances but also multiple labels. Another combination of interest appears in a recent paper build- 
ing on our previous work (Cour et al., 2009) that addresses the case where sets of instances are 
ambiguously labeled with candidate labeling sets (Luo and Orabona, 2010). 


2.2 Learning From Partially-labeled or Ambiguous Data 


There have been several papers that addressed the ambiguous label problem. A number of these use 
the expectation-maximization algorithm (EM) to estimate the model parameters and the true label 
(Côme et al., 2008; Ambroise et al., 2001; Vannoorenberghe and Smets, 2005; Jin and Ghahramani, 
2002). For example Jin and Ghahramani (2002) use an EM-like algorithm with a discriminative log- 
linear model to disambiguate correct labels from incorrect ones. Grandvalet and Bengio (2004) add 
a minimum entropy term to the set of possible label distributions, with a non-convex objective as 
in the case of (Jin and Ghahramani, 2002). Hullermeier and Beringer (2006) propose several non- 
parametric, instance-based algorithms for ambiguous learning based on greedy heuristics. These 
papers only report results on synthetically-created ambiguous labels for data sets such as the UCI 
repository. Also, the algorithms proposed rely on iterative non-convex learning. 
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2.3 Images and Captions 


A related multi-class setting is common for images with captions: for example, a photograph of a 
beach with a palm tree and a boat, where object locations are not specified. Duygulu et al. (2002) 
and Barnard et al. (2003) show that such partial supervision can be sufficient to learn to identify the 
object locations. The key observation is that while text and images are separately ambiguous, jointly 
they complement each other. The text, for instance, does not mention obvious appearance properties, 
but the frequent co-occurrence of a word with a visual element could be an indication of association 
between the word and a region in the image. Of course, words in the text without correspondences 
in the image and parts of the image not described in the text are virtually inevitable. The problem 
of naming image regions can be posed as translation from one language to another. Barnard et al. 
(2003) address it using a multi-modal extension to mixture of latent Dirichlet allocations. 


2.4 Names and Faces 


The specific problem of naming faces in images and videos using text sources has been addressed 
in several works (Satoh et al., 1999; Berg et al., 2004; Gallagher and Chen, 2007; Everingham et al., 
2006). There is a vast literature on fully supervised face recognition, which is out of the scope of this 
paper. Approaches relevant to ours include Berg et al. (2004), which aims at clustering face images 
obtained by detecting faces from images with captions. Since the name of the depicted people 
typically appears in the caption, the resulting set of images is ambiguously labeled if more than 
one name appears in the caption. Moreover, in some cases the correct name may not be included 
in the set of potential labels for a face. The problem can be solved by using unambiguous images 
to estimate discriminant coordinates for the entire data set. The images are clustered in this space 
and the process is iterated. Gallagher and Chen (2007) address the similar problem of retrieval from 
consumer photo collections, in which several people appear in each image which is labeled with 
their names. Instead of estimating a prior probability for each individual, the algorithm estimates a 
prior for groups using the ambiguous labels. Unlike Berg et al. (2004), the method of Gallagher and 
Chen (2007) does not handle erroneous names in the captions. 


2.5 People in Video 


In work on video, a wide range of cues has been used to automatically obtain supervised data, 
including: captions or transcripts (Everingham et al., 2006; Cour et al., 2008; Laptev et al., 2008), 
sound (Satoh et al., 1999) to obtain the transcript, or clustering based on clothing, face and hair 
color within scenes to group instances (Ramanan et al., 2007). Most of the methods involve either 
procedural, iterative reassignment schemes or non-convex optimization. 


3. Formulation 


In the standard supervised multiclass setting, we have labeled examples S = {(x;,y;)/, } from an 
unknown distribution P(X ,Y) where X € X is the input and Y € {1,...,L} is the class label. In the 
partially supervised setting we investigate, instead of an unambiguous single label per instance we 
have a set of labels, one of which is the correct label for the instance. We will denote y; = {y;} U 
Zi as the ambiguity set actually observed by the learning algorithm, where z; C {1,...,L}\ {y;} 
is a set of additional labels, and y; the latent groundtruth label which we would like to recover. 
Throughout the paper, we will use boldface to denote sets and uppercase to denote random variables 
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Figure 3: Left: Co-occurrence graph of the top characters across 16 episodes of Lost. Edge thick- 
ness corresponds to the co-occurrence frequency of characters. Right: The model of the 
data generation process: (X,Y) are observed, (Y,Z) are hidden, with Y = Y UZ. 


with corresponding lowercase values of random variables. We suppose X,Y,Z are distributed 
according to an (unknown) distribution P(X ,Y,Z) = P(X)P(Y | X)P(Z|X,Y) (see Figure 3, right), 
of which we only observe samples of the form S = {(x;,yi)7_,} = {(ai, {yi} Uzi)7L, }. (In case X is 
continuous, P(X) is a density with respect to some underlying measure u on X, but we will simply 
refer to the joint P(X ,Y,Z) as a distribution.) With the above definitions, y; € y;,z; C Yi, Yi É Z; and 
YEY,ZcY,Y ¢Z. 

Clearly, our setup generalizes the standard semi-supervised setting where some examples are 
labeled and some are unlabeled: an example is labeled when the corresponding ambiguity set y; is a 
singleton, and unlabeled when y; includes all the labels. However, we do not explicitly consider the 
semi-supervised setting this paper, and our analysis below provides essentially vacuous bounds for 
the semi-supervised case. Instead, we consider the middle-ground, where all examples are partially 
labeled as described in our motivating examples and analyze assumptions under which learning can 
be guaranteed to succeed. 


In order to learn from ambiguous data, we must make some assumptions about the distribution 
P(Z| X,Y). Consider a very simple ambiguity pattern that makes accurate learning impossible: 
L = 3, |z;| = 1 and label 1 is present in every set y;, for all i. Then we cannot distinguish between the 
case where 1 is the true label of every example, and the case where it is not a label of any example. 
More generally, if two labels always co-occur when present in y, we cannot tell them apart. In order 
to disallow this case, below we will make an assumption on the distribution P(Z | X,Y ) that ensures 
some diversity in the ambiguity set. This assumption is often satisfied in practice. For example, 
consider our initial motivation of naming characters in TV shows, where the ambiguity set for any 
given detected face in a scene is given by the set of characters occurring at some point in that scene. 
In Figure 3 (left), we show the co-occurrence graph of characters in a season of the TV show Lost, 
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Symbol Meaning 
x,X observed input value/variable: x,X € X 
y,Y hidden label value/variable: y,Y € {1,...,L} 
z,Z hidden additional label set/variable: z,Z C {1,...,L} 
y,Y observed label set/variable: y = {y}Uz, Y = {Y}UZ 
h(x), h(X) multiclass classifier mapping h : X > {1,...,L} 
L(h(x),y), La(A(x),y) standard and partial 0/1 loss 





Table 1: Summary of notation used. 


where the thickness of the edges corresponds to the number of times characters share a scene. This 
suggests that for most characters, ambiguity sets are diverse and we can expect that the ambiguity 
degree is small. A more quantitative diagram will be given in Figure 11 (left). 

Many formulations of fully-supervised multiclass learning have been proposed based on mini- 
mization of convex upper bounds on risk, usually, the expected 0/1 loss (Zhang, 2004): 


0/1 loss: £(h(x),y) = 2(A(x) y), 


where h(x): X + {1,...,L} is a multiclass classifier. 

We cannot evaluate the 0/1 loss using our partially labeled training data. We define a surro- 
gate loss which we can evaluate, and we call ambiguous or partial 0/1 loss (where A stands for 
ambiguous): 

Partial 0/1 loss: L4(h(x),y) = (h(x) ¢ y). 


3.1 Connection Between Partial and Standard 0/1 Losses 


An obvious observation is that the partial loss is an underestimate of the true loss. However, in 
the ambiguous learning setting we would like to minimize the true 0/1 loss, with access only to 
the partial loss. Therefore we need a way to upper-bound the 0/1 loss using the partial loss. We 
first introduce a measure of hardness of learning under ambiguous supervision, which we define as 
ambiguity degree € of a distribution P(X ,Y,Z): 


Ambiguity degree: £= sup P(zEZ|X =x,Y =y). (1) 
x,y,z:P(x,y)>0,zE {1,...,L} 


In words, € corresponds to the maximum probability of an extra label z co-occurring with a 
true label y, over all labels and inputs. Let us consider several extreme cases: When £ = 0, Z = 0 
with probability one, and we are back to the standard supervised learning case, with no ambiguity. 
When £ = 1, some extra label always co-occurs with a true label y on an example x and we cannot 
tell them apart: no learning is possible for this example. For a fixed ambiguity set size C (Le., 
P(|Z| = C) = 1), the smallest possible ambiguity degree is e = C/(L— 1), achieved for the case 
where P(Z | X,Y) is uniform over subsets of size C, for which we have P(z € Z| X,Y) =C/(L—1) 
for all z € {1,...,L}\{y}. Intuitively, the best case scenario for ambiguous learning corresponds to 
a distribution with high conditional entropy for P(Z | X,Y). 

The following proposition shows we can bound the (unobserved) 0/1 loss by the (observed) 
partial loss, allowing us to approximately minimize the standard loss with access only to the partial 
one. The tightness of the approximation directly relates to the ambiguity degree. 
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Proposition 1 (Partial loss bound via ambiguity degree £) For any classifier h and distribution 
P(X,Y,Z), with Y = X UZ and ambiguity degree €: 

















Spl La(h(X),¥)] < EPLL), Y)] < -——BplLa(h(X),¥)] 























with the convention 1/0 = +œ. These bounds are tight, and for the second one, for any (rational) 
€, we can find a number of labels L, a distribution P and classifier h such that equality holds. 


Proof. All proofs appear in Appendix B. | 


3.2 Robustness to Outliers 


One potential issue with Proposition 1 is that unlikely (outlier) pairs x,y (with vanishing P(x,y)) 
might force € to be close to 1, making the bound very loose. We show we can refine the notion of 
ambiguity degree € by excluding such pairs. 


Definition 2 (€,5)-ambiguous distribution. A distribution P(X ,Y,Z) is (€,5)-ambiguous if there 
exists a subset G of the support of P(X ,Y), GC X x {1,...,L} with probability mass at least 1 — 6, 
that is, f K cg P(X =x,Y =y)du(x,y) > 1—8, integrated with respect to the appropriate underlying 
measure uon X x {1,...,L}, for which 


sup P(zE€Z|X =x,Y =y) <e. 
(x,y)€G,zE{1,...,L} 


Note that in the extreme case € = 0 corresponds to standard semi-supervised learning, where 
1 — 6-proportion of examples are unambiguously labeled, and 6 are (potentially) fully unlabeled. 
Even though we can accommodate it, semi-supervised learning is not our focus in this paper and 
our bounds are not well suited for this case. 

This definition allows us to bound the 0/1 loss even in the case when some unlikely set of 
pairs x,y with probability < 6 would make the ambiguity degree large. Suppose we mix an initial 
distribution with small ambiguity degree, with an outlier distribution with large overall ambiguity 
degree. The following proposition shows that the bound degrades only by an additive amount, which 
can be interpreted as a form of robustness to outliers. 


Proposition 3 (Partial loss bound via (¢,5) ) For any classifier h and (€,5)-ambiguous P(Z| X,Y), 

















Ep[£(h(X),Y)] < — Epl (h(X),¥)] +8. 











A visualization of the bounds in Proposition 1 and Proposition 3 is shown in Figure 4. 


3.3 Label-specific Recall Bounds 


In the types of data from video experiments, we observe that certain subsets of labels are harder to 
disambiguate than others. We can further tighten our bounds between ambiguous loss and standard 
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Figure 4: Feasible region for expected ambiguous and true loss, for € = 0.2,6 = 0.05. 
0/1 loss if we consider label-specific information. We define the label-specific ambiguity degree £a 


of a distribution (with a € {1,...,L}) as: 


Ea = sup P(zE€Z|X =x,Y =a). 
x,z:P(X=x,Y=a)>03z€{1,...,L} 


We can show a label-specific analog of Proposition 1: 


Proposition 4 (Label-specific partial loss bound) For any classifier h and distribution P(X ,Y,Z) 
with label-specific ambiguity degree €, , 


1 
l— £4 





























Ep[L(h(X),¥) |Y =a] < zp[La (A(X), Y) |Y =a], 
where we see that €, bounds per-class recall. 

These bounds give a strong connection between ambiguous loss and real loss when € is small. 
This assumption allows us to approximately minimize the expected real loss by minimizing (an 
upper bound on) the ambiguous loss, as we propose in the following section. 


4. A Convex Learning Formulation 


We have not assumed any specific form for our classifier h(x) above. We now focus on a particular 
family of classifiers, which assigns a score g,(x) to each label a for a given input x and select the 
highest scoring label: 


h(x) = arg max ga(x). 


We assume that ties are broken arbitrarily, for example, by selecting the label with smallest index a. 
We define the vector g(x) = [g1(x)...gz(x)]', with each component g4 : X +> R in a function class 
G. Below, we use a multi-linear function class G by assuming a feature mapping f(x) : X => RI 
from inputs to d real-valued features and let g4(x) = Wa -f(x), where wa € R¢ is a weight vector for 
each class, bounded by some norm: ||wa]|p < B for p = 1,2. 

We build our learning formulation on a simple and general multiclass scheme, frequently used 
for the fully supervised setting (Crammer and Singer, 2002; Rifkin and Klautau, 2004; Zhang, 
2004; Tewari and Bartlett, 2005), that combines convex binary losses y(-) : R > R+ on individual 
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components of g to create a multiclass loss. For example, we can use hinge, exponential or logistic 
loss. In particular, we assume a type of one-against-all scheme for the supervised case: 


Ly(g(x),y) = W(gy(x)) + 3 W(—8a(x)). 
a#y 


A classifier h,(x) is selected by minimizing the empirical loss Ly on the sample S = {x;,y;}"", 
(called empirical w-risk) over the function class G: 


m 


inf Es[Ly(g(X),Y)] = inf a $ Ly(8(xi),9%)- 


gEG seGM i 














For the fully supervised case, under appropriate assumptions, this form of the multiclass loss 
is infinite-sample consistent. This means that a minimizer ĝ of w-risk achieves optimal 0/1 risk 
inf, Es|Ly(g(X),¥)] = inf, Ep[|£(g(X),Y)] as the number of samples m grows to infinity, provided 
that the function class G grows appropriately fast with m to be able to approximate any function 
from X to R and y(u) satisfies the following conditions: (1) y(u) is convex, (2) bounded below, (3) 
differentiable and (4) y(u) < y(—u) when u > 0 (Theorem 9 in Zhang (2004)). These conditions 
are satisfied, for example, for the exponential, logistic and squared hinge loss max(0, 1 — u)?. Below, 
we construct a loss function for the partially labeled case and consider when the proposed loss is 
consistent. 


























og 


4.1 Convex Loss for Partial Labels 


In the partially labeled setting, instead of an unambiguous single label y per instance we have a set 
of labels Y, one of which is the correct label for the instance. We propose the following loss, which 
we call our Convex Loss for Partial Labels (CLPL): 


Ly(g(x),¥) -v( GE a] + $ W(—8a(x)). (2) 


acy ag¢y 


Note that if y is a singleton, the CLPL function reduces to the regular multiclass loss. Otherwise, 
CLPL will drive up the average of the scores of the labels in y. If the score of the correct label is 
large enough, the other labels in the set do not need to be positive. This tendency alone does not 
guarantee that the correct label has the highest score. However, we show in Proposition 6 that 
Ly(g(x),y) upperbounds £,4(g(x),y) whenever y(-) is an upper bound on the 0/1 loss. 

Of course, minimizing an upperbound on the loss does not always lead to sensible algorithms. 
We show next that our loss function is consistent under certain assumptions and offers a tighter 
upperbound to the ambiguous loss compared to a more straightforward multi-label approach. 


4.2 Consistency for Partial Labels 


We derive conditions under which the minimizer of the CLPL in Equation 2 with partial labels 
achieves optimal 0/1 risk: infgeg Es[Ly(g(X), Y)] = infyeg Ep[£(g(X),Y)] in the limit of infinite 
data and arbitrarily rich G. Not surprisingly, our loss function is not consistent without making some 
additional assumptions on P(Y | X) beyond the assumptions for the fully supervised case. Note that 
the Bayes optimal classifier for 0/1 loss satisfies the condition h(x) € argmax, P(Y =a | X =x), and 
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may not be unique. First, we require that argmax P(Y = a | X =x) =argmax,P(a € Y | X =x), 
since otherwise arg max, P(Y =a | X =x) cannot be determined by any algorithm from partial labels 
Y without additional information even with an infinite amount of data. Second, we require a simple 
dominance condition as detailed below and provide a counterexample when this condition does not 
hold. The dominance relation defined formally below states that when a is the most (or one of the 
most) likely label given x according to P(Y | X = x) and b is not, c U {a} has higher (or equal) 
probability than cU {b} for any set of other labels c. 


Proposition 5 (Partial label consistency) Suppose the following conditions hold: 
o w(-) is differentiable, convex, lower-bounded and non-increasing, with (0) < 0. 
e When P(X =x) >0, argmaxy P(Y =a’ | X = x) = argmaxy P(a’ € Y | X =x). 


e The following dominance relation holds: Va € argmax, P(a' € Y | X =x), Vb ¢ argmax,, 
P(d €Y|X =x), Vec {1,..., L} {a,b}: 


P(Y = cU {a} | X =x) > P(Y =cU {b} |X =x). 
Then Ly(g(x),y) is infinite-sample consistent: 


inf Es[Ly(g(X),Y)] = inf Ep[L£(g(X),Y)], 
gEG gEG 


























as |S| = m > «and G > IR“. As a corollary, consistency is implied when ambiguity degree £ < 1 
and P(Y | X) is deterministic, that is, P(Y |X) = L(Y =h(X)) for some h(-). 


If the dominance relation does not hold, we can find counter-examples where consistency fails. 
Consider a distribution with a single x with P(x) > 0, and let L = 4, P(/Y| = 2 | X =x) = 1, w be 
the square-hinge loss, and P(Y | X = x) be such that: 








a 
250-Pa | 1 2 3 4 
1 29 44 0 

b 2 29 0 17 26 
3 44 17 0 9 


4 0 26 9 0 
250-Pa | 73 72 70 35 




















Above, the abbreviations are Pap = P(Y = {a,b} | X =x) and P, = Xp Pap, and the entries that do not 
satisfy the dominance relation are in bold. We can explicitly compute the minimizer of Ly, which 
isg= (5Pab + diag (2 3P,))~|(3Pa —2) x — [ 0.6572 0.6571 0.6736 0.8568 |. It satisifes 
arg max, ga = 2 but arg max, }p Pap = 1. 





4.3 Comparison to Other Loss Functions 


The “naive” partial loss, proposed by Jin and Ghahramani (2002), treats each example as having 
multiple correct labels, which implies the following loss function 


L T w(ga(x)) +E w(—ga(x)). 6) 


ly| acy ag¢y 


Ly" (g(x), y) 
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Figure 5: Our loss function in Equation 2 provides a tighter convex upperbound than the naive loss 
Equation 3 on the non-convex max-loss Equation 4. (Left) We show the square hinge 
w (blue) and a chord (red) touching two points g1,g2. The horizontal lines correspond 
to our loss y(5(gi +82)) Equation 2, the max-loss y(max(g;,g2)), and the naive loss 
5(W(g1) + W(g2)) (ignoring negative terms and assuming y = {1,2}). (Middle) Corre- 
sponding losses as we vary gı € |—2,2] (with g2 = 0). (Right) Same, with g2 = —g. 


One reason we expect our loss function to outperform the naive approach is that we obtain a tighter 
convex upper bound on £4. Let us also define 


Leey) =v (max gla) + F(a), a) 
agy 

which is not convex, but is in some sense closer to the desired true loss. The following inequalities 

are verified for common losses y such as square hinge loss, exponential loss, and log loss with 

proper scaling: 


Proposition 6 (Comparison between partial losses) Under the usual conditions that is a con- 
vex, decreasing upper bound of the step function, the following inequalities hold: 


2LA < Viki < iy < ae 
The 2"! and 3" bounds are tight, and the first one is tight provided y(0) = 1 and lim,» Y = 0. 


This shows that our CLPL Ly is a tighter approximation to £4 than Laive, as illustrated in 
Figure 5. To gain additional intuition as to why CLPL is better than the naive loss Equation 3: for 
an input x with ambiguous label set (a,b), CLPL only encourages the average of ga(x) and g(x) 
to be large, allowing the correct score to be positive and the extraneous score to be negative (e.g., 
8a(x) = 2,gp(x) = —1). In contrast, the naive model encourages both gq(x) and g,(x) to be large. 


4.4 Generalization Bounds 


To derive concrete generalization bounds on multiclass error for CLPL we define our function class 
for g. We assume a feature mapping f(x) : X +> R? from inputs to d real-valued features and let 
2a(X) = Wa f(x), where Wa € Rf is a weight vector for each class, bounded by Lz norm: ||Wa||2 < B. 
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We use y(u) = max(0,1—w)? (for example hinge loss with p = 1, squared hinge loss with p = 2). 
The corresponding margin-based loss is defined via a truncated, rescaled version of y: 


1 ifu <0, 
yylu)= 4 (1—u/y)P if0<u<y, 
0 ifu >y. 


Proposition 7 (Generalization bound) For any integer m and any 1 € (0,1), with probability at 
least 1 — n over samples S = { (xi, yi) }", for every g in G: 


5/2 i : 
Hace e o a S, 


















































where c is an absolute constant from Lemma 12 in the appendix, Es is the sample average and L is 
the number of labels. 





The proof in the appendix uses definition 11 for Rademacher and Gaussian complexity, Lemma 
12, Theorem 13 and Theorem 14 from Bartlett and Mendelson (2002), reproduced in the appendix 
and adapted to our notations for completeness. Using Proposition 7 and Proposition 1, we can derive 


the following bounds on the true expected 0/1 loss Ep[£(g(X),Y)] from purely ambiguous data: 














Proposition 8 (Generalization bounds on true loss) For any distribution €-ambiguous distribu- 
tion P, integer m and € (0,1), with probability at least 1 — y over samples S = {(x;,yi)}/_,, for 
every gE G: 


/ R 8log 2 
ELLEP] < gh (Esli, Y+ PEE ECO Ea | 


5. Transductive Analysis 









































We now turn to the analysis of our Convex Loss for Partial Labels (CLPL) in the transductive setting. 
We show guarantees on disambiguating the labels of instances under fairly reasonable assumptions. 


Example 1 Consider a data set S of two points, x,x', with label sets {1,2},{1,3}, respectively and 
suppose that the total number of labels is 3. The objective function is given by: 





W65 (812) HRA) UEO) HGE) E) Hg). 


Suppose the correct labels are (1,1). It is clear that without further assumptions about x and x' 
we cannot assume that the minimizer of the loss above will predict the right label. However, if f(x) 
and f(x’) are close, it should be intuitively clear that we should be able to deduce the label of the 
two examples is 1. 


A natural question is under what conditions on the data will CLPL produce a labeling that is 
consistent with groundtruth. We provide an analysis under several assumptions. 
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5.1 Definitions 


In the remainder of this section, we denote y(x) (resp. y(x)) as the true label (resp. ambiguous 
label set) of some x € S, and z(x) = y(x)\{y(x)}. ||-|| denotes an arbitrary norm, with ||- ||* its 
dual norm. As above, y denotes a decreasing upper bound on the step function and g a classifier 
satisfying: Va, ||Wa||* < 1 (we can easily generalize the remaining propositions to the case where 
ga is 1-Lipschitz and f is the identity). For x € S and N > 0, we define By (x) as the set of neighbors 
of x that have the same label as x: 


By (x) = {x E S\{x} = IFE- FOI < 1,90") = y). 


Lemma 9 Let x € S. If Ly(g(x),y(x)) < w(n/2) and Va € 2(x), 3x € By(x) such that ga(x’) < 
—n/2, then g predicts the correct label for x. 





In other words, g predicts the correct label for x when its loss is sufficiently small, and for each of 
its ambiguous labels a, we can find a neighbor with same label whose score g,(x’) is small enough. 
Note that this does not make any assumption on the nearest neighbors of x. 





Corollary 10 Let x E€ S. Suppose Aq > 0, x1...%q E By(x) such that Mi=0..9Z(xi) = 0, 
maxj=o..q Ly(g(xi), y(x:)) < w(n/2) (with xo := x). Then g predicts the correct label for x. 


In other words, g predicts the correct label for x if we can find a set of neighbors of the same label 
with small enough loss, and without any common extra label. This simple condition often arises in 
our experiments. 


6. Algorithms 


Our formulation is quite flexible and we can derive many alternative algorithms depending on the 
choice of the binary loss y(u), the regularization, and the optimization method. We can minimize 
Equation 2 using off-the-shelf binary classification solvers. To do this, we rewrite the two types of 
terms in Equation 2 as linear combinations of m-L feature vectors. We stack the parameters and 
features into one vector as follows below, so that ga(x) = Wa f(x) = w-f(x,a): 


Wi I (a = 1)f(x) 
We | cen | f(x,a)= Sc 
Wh 1(a=L)f(x) 


We also define f(x,y) to be the average feature vector of the labels in the set y: 
1 
f(x,y) = = F fla). 
Mia 


With these definitions, we have: 


Ly(g(x),y) = w(w- f(x,y) +) w(—w- f(x,a)). 


a¢y 


Then to use a binary classification method to solve CLPL optimization, we simply transform the 
m partially labelled training examples S = {x;,y;}7"., into m positive examples S = {f(x;,y;) ¥ 1 
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and yi; L — |y;| negative examples S_ = {f(x;,4) }i" | a gy; Note that the increase in dimension of the 
features by a factor of L does not significantly affect the running time of most methods since the 
vectors are sparse. We use the off-the-shelf implementation of binary SVM with squared hinge (Fan 
et al., 2008) in most of our experiments, where the objective is: 


1 
min 5||w||3 +C} max(0,1 — w- f(x,y)? +C $, max(0,1 +w- f(x, a)’. 
i ia¢yi 


Using hinge loss and L; regularization lead to a linear programming formulation, and using Lı 
with exponential loss leads naturally to a boosting algorithm. We present (and experiment with) 
a boosting variant of the algorithm, allowing efficient feature selection, as described in Appendix 
A. We can also consider the case where the regularization is Ly and f(x) : X +> Rf is a nonlinear 
mapping to a high, possibly infinite dimensional space using kernels. In that case, it is simple to 
show that 

w= Y af(xi,yi) TE Ł Qi af(x;,a), 
l 


ia¢yi 


for some set of non-negative ’s, where Q; corresponds to the positive example f(x;,y;), and Oia 
corresponds to the negative example f(x;,a), for a ¢ y;. Letting K(x,x’) = f(x) -f(x’) be the kernel 
function, note that f(x, a) -f(x’,b) = 1(a = b)K(x,x’). Hence, we have: 


w-f(x,b)= £ hit = PK Ca) — Y ojala =b)K(x;,x). 
iaey; (Yi i,agy; 


This transformation allows us to use kernels with standard off-the-shelf binary SVM implementa- 
tions. 


7. Controlled Partial Labeling Experiments 


We first perform a series of controlled experiments to analyze our Convex Learning from Partial La- 
bels (CLPL) framework on several data sets, including standard benchmarks from the UCI repos- 
itory (Asuncion and Newman, 2007), a speaker identification task from audio extracted from 
movies, and a face naming task from Labeled Faces in the Wild (Huang et al., 2007b). In Section 
8 we also consider the challenging task of naming characters in TV shows throughout an entire 
season. In each case the goal is to correctly label faces/speech segments/instances from examples 
that have multiple potential labels (transductive case), as well as learn a model that can generalize 
to other unlabeled examples (inductive case). 

We analyze the effect on learning of the following factors: distribution of ambiguous labels, 
size of ambiguous bags, proportion of instances which contain an ambiguous bag, entropy of the 
ambiguity, distribution of true labels and number of distinct labels. We compare our CLPL approach 
against a number of baselines, including a generative model, a discriminative maximum-entropy 
model, a naive model, two K-nearest neighbor models, as well as models that ignore the ambiguous 
bags. We also propose and compare several variations on our cost function. We conclude with 
a comparative summary, analyzing our approach and the baselines according to several criteria: 
accuracy, applicability, space/time complexity and running time. 
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7.1 Baselines 


In the experiments, we compare CLPL with the following baselines. 


7.1.1 CHANCE BASELINE 


We define the chance baseline as randomly guessing between the possible ambiguous labels only. 
Defining the (empirical) average ambiguous size to be Es|ly|] = +." ly;|, then the expected error 
1 


from the chance baseline is given by errOfchance = 1 — Bly]: 














7.1.2 NAIVE MODEL 


We report results on an un-normalized version of the naive model introduced in Equation 3: 
Lacy Y (Sa(*)) + Lagy W(—8a(x)), but both normalized and un-normalized versions produce very 
similar results. After training, we predict the label with the highest score (in the transductive set- 


ting): ¥ = arg MaXacy 8a (x). 


7.1.3 IBM MODEL 1 


This generative model was originally proposed in Brown et al. (1993) for machine translation, but 
we can adapt it to the ambiguous label case. In our setting, the conditional probability of observ- 
ing example x € R? given that its label is a is Gaussian: x ~ N (Ua, Xa). We use the expectation- 
maximization (EM) algorithm to learn the parameters of the Gaussians (mean ua and diagonal co- 
variance matrix X4 = diag(6,) for each label). 


7.1.4 DISCRIMINATIVE EM 


We compare with the model proposed in Jin and Ghahramani (2002), which is a discriminative 
model with an EM procedure adapted for the ambiguous label setting. The authors minimize the 
KL divergence between a maximum entropy model P (estimated in the M-step) and a distribution 
over ambiguous labels P (estimated in the E-step): 


J(0,P) =} }, P(a| x) log Geren) l 


i acy 


7.1.5 K-NEAREST NEIGHBOR 


Following Hullermeier and Beringer (2006), we adapt the k-Nearest Neighbor Classifier to the 
ambiguous label setting as follows: 


k 
knn(x) = argmax } 1 well (a € yi), (5) 
acy = 


where x; is the i’” nearest-neighbor of x using Euclidean distance, and w; are a set of weights. We 
use two KNN baselines: KNN assumes uniform weights w; = 1 (model used in Hullermeier and 
Beringer, 2006), and weighted KNN uses linearly decreasing weights w; = k—i +1. We use k=5 
and break ties randomly as in Hullermeier and Beringer (2006). 
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7.1.6 SUPERVISED MODELS 


Finally we also consider two baselines that ignore the ambiguous label setting. The first one, de- 
noted as supervised model, removes from Equation 3 the examples with |y| > 1. The second model, 
denoted as supervised KNN, removes from Equation 5 the same examples. 


7.2 Data Sets and Feature Description 


We describe below the different data sets used to report our experiments. The experiments for 
automatic naming of characters in TV shows can be found in Section 8. A concise summary is 
given in Table 2. 





























Data Set # instances (m) | # features (d) | # labels (L) prediction task 
UCI: dermatology 366 34 6 disease diagnostic 
UCI: ecoli 336 8 8 site prediction 
UCI: abalone 4177 8 29 age determination 
FIW(10b) 500 50 10 (balanced) | face recognition 
FIW(10) 1456 50 10 face recognition 
FITW(100) 3011 50 100 face recognition 
Lost audio 522 50 19 speaker id 
TV+movies 10,000 50 100 face recognition 























Table 2: Summary of data sets used in our experiments. The TV+movies experiments are treated 
in Section 8. Faces in the Wild (1) uses a balanced distribution of labels (first 50 images 
for the top 10 most frequent people). 


7.2.1 UCI DATA SETS 


We selected three biology related data sets from the publicly available UCI repository (Asuncion 
and Newman, 2007): dermatology, ecoli, abalone. As a preprocessing step, each feature was inde- 
pendently scaled to have zero mean and unit variance. 


7.2.2 FACES IN THE WILD (FIW) 


We experiment with different subsets of the publicly available Labeled Faces in the Wild (Huang 
et al., 2007a) data set. We use the images registered with funneling (Huang et al., 2007a), and crop 
out the central part corresponding to the approximate face location, which we resize to 60x90. We 
project the resulting grayscale patches (treated as 5400x1 vectors) onto a 50-dimensional subspace 
using PCA.? In Table 2, FIW(10b) extracts the first 50 images for each of the top 10 most frequent 
people (balanced label distribution); FIW(10) extracts all images for each of the top 10 most fre- 
quent people (heavily unbalanced label distribution, with 530 hits for George Bush and 53 hits for 
John Ashcroft); FIW(100) extracts up to 100 faces for each of the top 100 most frequent people 
(again, heavily unbalanced label distribution). 





2. We kept the features simple by design; more sophisticated part-based registration and representation would further 
improve results, as we will see in Section 8. 
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7.2.3 SPEAKER IDENTIFICATION FROM AUDIO 


We also investigate a speaker identification task based on audio in an uncontrolled environment. 
The audio is extracted from an episode of Lost (season 1, episode 5) and is initially completely 
unaligned. Compared to recorded conversation in a controlled environment, this task is more re- 
alistic and very challenging due to a number of factors: background noise, strong variability in 
tone of voice due to emotions, and people shouting or talking at the same time. We use the Hid- 
den Markov Model Toolkit (HTK) (http: //htk.eng.cam.ac.uk/) to compute forced alignment 
(Moreno et al., 1998; Sjölander, 2003), between the closed captions and the audio (given the rough 
initial estimates from closed caption time stamps, which are often overlapping and contain back- 
ground noise). After alignment, our data set is composed of 522 utterances (each one corresponding 
to a closed caption line, with aligned audio and speaker id obtained from aligned screenplay), with 
19 different speakers. For each speech segment (typically between 1 and 4 seconds) we extract 
standard voice processing audio features: pitch (Talkin, 1995), Mel-Frequency Cepstral Coefficients 
(MFCC) (Mermelstein, 1976), Linear predictive coding (LPC) (Proakis and Manolakis, 1996). This 
results in a total of 4,000 features, which we normalize to the range [—1, 1] and then project onto 50 
dimensions using PCA. 


7.3 Experimental Setup 


For the inductive experiments, we split randomly in half the instances into (1) ambiguously la- 
beled training set, and (2) unlabeled testing set. The ambiguous labels in the training set are 
generated randomly according to different noise models which we specify in each case. For each 
method and parameter setting, we report the average test error rate over 20 trials after training 
the model on the ambiguous train set. We also report the corresponding standard deviation as an 
error bar in the plots. Note, in the inductive setting we consider the test set as unlabeled, thus the 
classifier votes among all possible labels: 
a“ = h(x) = ang ee aU 

For the transductive experiments, there is no test set; we report the error rate for disambiguating 
the ambiguous labels (also averaged over 20 trials corresponding to random settings of ambiguous 
labels). The main differences with the inductive setting are: (1) the model is trained on all instances 
and tested on the same instances; and (2) the classifier votes only among the ambiguous labels, 
which is easier: 


* 


a’ = h(x) = arg max ga (x). 
acy 


We compare our CLPL approach (denoted as mean in figures, due to the form of the loss) 
against the baselines presented in Section 7.1: Chance, Model 1, Discriminative EM model, k- 
Nearest Neighbor, weighted k-Nearest Neighbor, Naive model, supervised model, and supervised 
KNN. Note, in our experiments the Discriminative EM model was much slower to converge than all 
the other methods, and we only report the first series of experiments with this baseline. 

Table 3 summarizes the different settings used in each experiment. We experiment with dif- 
ferent noise models for ambiguous bags, parametrized by p,q,€, see Figure 6. p represents the 
proportion of examples that are ambiguously labeled. q represents the number of extra labels for 
each ambiguous example. € represents the degree of ambiguity (defined in 1) for each ambiguous 
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example. We also vary the dimensionality by increasing the number of PCA components from 1 to 
200, with half of extra labels added uniformly at random. In Figure 7, we vary the ambiguity size q 
for three different subsets of Faces in the Wild. We report results on additional data sets in Figure 8. 





Experiment 
# of ambiguous bags 


induct. | data set parameter 
yes FIW(10b) p € [0,0.95],q=2 














degree of ambiguity yes FIW(10b) | p =1,¢=1,e€ [1/(L—1),1] 
degree of ambiguity no FIW(10b) | p = 1,g=1,e€ [1/(L—1), 1] 
dimension yes FIW(10b) | p=1,q= Ed € [1,..,200] 





ambiguity size yes FIW(10b) p=1,q€ [0,0.9(L-— 1 
yes FIW(10) p= 1,4 € [0,0.9(L— 1 
yes FIW(100) p= 1,4 € [0,0.9(L— 1 


(L-1) 
(L-1) 
(L-1) 
yes Lost audio p= 1,q € [0,0.9(L— 1) 
(L-1) 
(L-1) 
(L-1) 





ambiguity size 





ambiguity size 





ambiguity size 





ambiguity size yes ecoli p= 1,4 € [0,0.9(L-— 1 
yes derma p= 1,4 € [0,0.9(L-— 1 


yes abalone p= 1,4 € [0,0.9(L-— 1 





ambiguity size 





œ| 00] 00) æl a} J A] oaol DW) o FP 








ambiguity size 























Table 3: Summary of controlled experiments. We experiment with 3 different noise models for 
ambiguous bags, parametrized by p,q,€. p represents the proportion of examples that are 
ambiguously labeled. g represents the number of extra labels for each ambiguous example 
(generated uniformly without replacement). € represents the degree of ambiguity for each 
ambiguous example (see definition 1). L is the total number of labels. We also study the 
effects of data set choice, inductive vs transductive learning, and feature dimensionality. 


7.3.1 EXPERIMENTS WITH A BOOSTING VERSION OF CLPL 


We also experiment with a boosting version of our CLPL optimization, as presented in Appendix A. 
Results are shown in Figure 9, comparing our method with KNN and the naive method (also using 
boosting). Despite the change in learning algorithm and loss function, the trends remain the same. 


7.4 Comparative Summary 


We can draw several conclusions. Our proposed CLPL model uniformly outperformed all base- 
lines in all but one experiment (UCI dermatology data set), where it ranked second closely behind 
Model 1. In particular CLPL always uniformly outperformed the naive model. The naive model 
ranks in second. As expected, increasing ambiguity size monotonically affects error rate. We also 
see that increasing € significantly affects error, even though the ambiguity size is constant, consis- 
tent with our bounds in Section 3.3. We also note that the supervised models defined in Section 
7.1.6 (which ignore the ambiguously labeled examples) consistently perform worse than their coun- 
terparts adapted for the ambiguous setting. For example, in Figure 6 (Top Left), a model trained 
with nearly all examples ambiguously labeled (“mean” curve”, p = 95%) performs as good as a 
model which uses 60% of fully labeled examples (“supervised” curve, p = 40%). The same holds 
between the “KNN” curve at p = 95% and the “supervised KNN” curve at p = 40%. 





3. We first choose at random for each label a dominant co-occurring label which is sampled with probability g; the rest 
of the labels are sampled uniformly with probability (1 — €)/ (L — 2) (there is a single extra label per example). 
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Figure 6: Results on Faces in the Wild in different settings, comparing our proposed CLPL (denoted 
as mean) to several baselines. In each case, we report the average error rate (y-axis) and 
standard deviation over 20 trials as in Figure 7. (top left) increasing proportion of am- 
biguous bags q, inductive setting. (top right) increasing ambiguity degree € (Equation 1), 
inductive setting. (bottom left) increasing ambiguity degree £ (Equation 1), transductive 
setting. (bottom right) increasing dimensionality, inductive setting. 


















































Figure 7: Additional results on Faces in the Wild, obtained by varying the ambiguity size g on the 
x-axis (inductive case). Left: balanced data set using 50 faces for each of the top 10 
labels. Middle: unbalanced data set using all faces for each of the top 10 labels. Right: 
unbalanced data set using up to 100 faces for each of the top 100 labels. 


7.4.1 COMPARISON WITH VARIANTS OF OUR APPROACH 


In order to get some intuition on CLPL (Equation 2), which we refer to as the mean model in our 
experiments, we also compare with the following sum and contrastive alternatives: 


Ly" (g(x), ¥) =V (x ct) +) w(-sa(x)), (6) 


acy a¢y 


spe RNC o = Ewha > 2a(x) -so) . (7) 


a'¢y ly| acy 


Figure 8: 
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Inductive results on different data sets. In each case, we report the average error rate (y- 
axis) and standard deviation over 20 trials as in Figure 7. Top Left: speaker identification 
from Lost audio. Top Right: ecoli data set (UCI). Bottom Left: dermatology data set 
(UCI). Bottom Right: abalone data set (UCI). 


3 labels/example k-NN baseline 





is_contrastive 


. ours 2 labels/example | mean 
naive 












on naive o7 


k-N N 06 


+ | —S— sum 







3 labelslexample 


4 labels/example 


5 labels/example 








nb neighbors (k) 
a a a | 5 10 15 20 25 30 ; ; $ * + 
boosting version boosting version variations of our loss 


Left: We experiment with a boosting version of the ambiguous learning, and compare 
to a boosting version of the naive baseline (here with ambiguous bags of size 3). We 
plot accuracy vs number of boosting rounds. The green horizontal line corresponds to 
the best performance (across k) of the k-NN baseline. Middle: accuracy of k-NN base- 
line across k. Right: we compare CLPL (labeled mean) with two variants defined in 
Equation 6,Equation 7, along with the naive model (same setting as Figure 6, Top Left). 


When y(-) is the hinge loss, the mean and sum model are very similar, but this is not the case for 
strictly convex binary losses. Figure 9 shows that variations on our cost function have little effect 
in the transductive setting. In the inductive setting, other experiments we performed show that the 
mean and sum version are still very similar, but the contrastive version is worse. In general it seems 
that models based on minimization of a convex loss function (naive and different versions of our 
model) usually outperform the other models. 
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Figure 10: Predictions on Lost and C.S.I.. Incorrect examples are: row 1, column 3 (truth: Boone); 
row 2, column 2 (truth: Jack). 


8. Experiments with Partially Labeled Faces in Videos 


We now return to our introductory motivating example, naming people in TV shows (Figure 1, 
right). Our goal is to identify characters given ambiguous labels derived from the screenplay. Our 
data consists of 100 hours of Lost and C.S.I., from which we extract ambiguously labeled faces to 
learn models of common characters. We use the same features, learning algorithm and loss function 
as in Section 7.2.2. We also explore using additional person- and movie-specific constraints to 
improve performance. Sample results are shown in Figure 10. 


8.1 Data Collection 


We adopt the following filtering pipeline to extract face tracks, inspired by Everingham et al. (2006): 
(1) Run the off-the-shelf OpenCV face detector over all frames, searching over rotations and scales. 
(2) Run face part detectors* over the face candidates. (3) Perform a 2D rigid transform of the parts 
to a template. (4) Compute the score of a candidate face s(x) as the sum of part detector scores 
plus rigid fit error, normalizing each to weight them equally, and filtering out faces with low score. 
(5) Assign faces to tracks by associating face detections within a shot using normalized cross- 
correlation in RGB space, and using dynamic programming to group them together into tracks. 
(6) Subsample face tracks to avoid repetitive examples. In the experiments reported here we use the 
best scoring face in each track, according to s(x). 

Concretely, for a particular episode, step (1) finds approximately 100,000 faces, step (4) keeps 
approximately 10,000 of those, and after subsampling tracks in step (6) we are left with 1000 face 
examples. 


8.2 Ambiguous Label Selection 


Screenplays for popular TV series and movies are readily available for free on the web. Given an 
alignment of the screenplay to frames, we have ambiguous labels for characters in each scene: the 
set of speakers mentioned at some point in the scene, as shown in Figure 1. Alignment of screenplay 
to video uses methods presented in Cour et al. (2008) and Everingham et al. (2006), linking closed 
captions to screenplay. 





4. The detectors use boosted cascade classifiers of Haar features for the eyes, nose and mouth. 
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Lost (#labels, #episodes) | (8,16) | (16,8)' | (16,16) | (32,16) 
Naive 14% | 18.6% | 16.5% | 18.5% 
ours (CLPL / “mean’’) 10% | 12.6% 14% 17% 
ours+constraints 6% n/a 11% 13% 























Table 4: Misclassification rates of different methods on TV show Lost. In comparison, for (16,16) 
the baseline performances are knn: 30%; Model 1: 44%; chance: 53%. +: This column 
contains results exactly reproducible from our publicly available reference implementa- 
tion, which can be found at http://vision.grasp.upenn.edu/video. For simplicity, 
this public code does not include a version with extra constraints. 


We use the ambiguous sets to select face tracks filtered through our pipeline. We prune scenes 
which contain characters other than the set we choose to focus on for experiments (top {8,16,32} 
characters), or contain 4 or more characters. This leaves ambiguous bags of size 1, 2 or 3, with an 
average bag size of 2.13 for Lost, and 2.17 for C.S.I.. 


8.3 Errors in Ambiguous Label Sets 


In the TV episodes we considered, we observed that approximately 1% of ambiguous label sets 
were wrong, in that they didn’t contain the ground truth label of the face track. This came from 
several reasons: presence of a non-english speaking character (Jin Kwon in Lost, who speaks Ko- 
rean) whose dialogue is not transcribed in the closed captions; sudden occurence of an unknown, 
uncredited character on screen, and finally alignment problems due to large discrepencies between 
screenplay and closed captions. While this is not a major problem, it becomes so when we con- 
sider additional cues (mouth motion, gender) that restrict the ambiguous label set. We will see how 
we tackle this issue with a robust confidence measure for obtaining good precision recall curves in 
Section 8.5. 


8.4 Results with the Basic System 


Now that we have a set of instances (face tracks), feature descriptors for the face track and am- 
biguous label sets for each face track, we can apply the same method as described in the previous 
section. We use a transductive setting: we test our method on our ambiguously labeled training set. 

The confusion matrix displaying the distribution of ambiguous labels for the top 16 characters 
in Lost is shown in Figure 11 (left). The confusion matrix of our predictions after applying our 
ambiguous learning algorithm is shown in Figure 11 (right). Our method had the most trouble dis- 
ambiguating Ethan Rom from Claire Littleton (Ethan Rom only appears in 0.7% of the ambiguous 
bags, 3 times less then the second least common character) and Liam Pace from Charlie Pace (they 
are brothers and co-occur frequently, as can be seen in the top figure). The case of Sun Kwon and Jin 
Kwon is a bit special, as Jin does not speak English in the series and is almost never mentioned in 
the closed-captions, which creates alignment errors between screenplay and closed captions. These 
difficulties illustrate some of the interesting challenges in ambiguously labeled data sets. As we 
can see, the most difficult classes are the ones with which another class is strongly correlated in the 
ambiguous label confusion matrix. This is consistent with the theoretical bounds we obtained in 
Section 3.3, which establish a relation between the class specific error rate and class specific degree 
of ambiguity €. 
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Jack 0.52381 0.91667 

John 0.48039 0.94118 

Charlie 0.45521 0.8625 

Kate 0.45317 0.84298 

James 0.45098 1 

Boone 0.44231 0.88462 

Hurley 05417 0.8125 

Sayid 0.44231 0.94872 

Michael 0.46668 1 

Claire 0.49454 0.90164 

Sun 0.5641 0.96154 

Walt 0.43939 0.72727 

Liam 0.46032 SEER] 
Shannon 0.45455 0.81818 
Jin OMSy 14 0128571 
Ethan 0.3888 0 





Figure 11: Left: Label distribution of top 16 characters in Lost (using the standard matlab color 
map). Element Dj; represents the proportion of times class i was seen with class j in 
the ambiguous bags, and D1 = 1. Right: Confusion matrix of predictions from Section 
8.4. Element A;; represents the proportion of times class i was classified as class j, and 
A1 =1. Class priors for the most frequent, the median frequency, and the least frequent 
characters in Lost are Jack Shephard, 14%; Hugo Reyes, 6%; Liam Pace 1%. 


Quantitative results are shown in Table 4. We measure error according to average 0-1 loss with 
respect to hand-labeled groundtruth labeled in 8 entire episodes of Lost. Our model outperforms all 
the baselines, and we will further improve results. We now compare several methods to obtain the 
best possible precision at a given recall, and propose a confidence measure to this end. 


8.5 Improved Confidence Measure for Precision-recall Evaluation 


We obtain a precision-recall curve using a refusal to predict scheme, as used by Everingham et al. 
(2006): we report the precision p for the r most confident predictions, varying r € [0,1]. We com- 
pare several confidence measures based on the classifier scores g(x) and propose a novel one that 
significantly improves precision-recall, see Figure 12 for results. 


1. the max and ratio confidence measures (as used in Everingham et al., 2006) are defined as: 


Cmax (g(x)) = max £a (x), 


olele) = max PE) 
Cratio(g(x)) a Lsexp(gi(x)) 


2. the relative score can be defined as the difference between the best and second best scores 
over all classifiers (ga)ac{1..1} (Where a* = argmax,e4)_ 1} 8a(x)): 


Cret(8(X)) = 8e (x) — ag 


3. we can define the relative-constrained score as an adaptation to the ambiguous setting; we 
only consider votes among ambiguous labels y (where a* = argmaxycy 8a (x)): 


Craty(@(*)) = Bar (2) — max gala). 
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Figure 12: Improved hybrid confidence measure for precision-recall evaluation. x axis: recall; 
y axis: naming error rate for CLPL on 16 episodes of Lost (top 16 characters). max 
confidence score performs rather poorly as it ignores other labels. relative improves 
the high precision/low recall region by considering the margin instead. The relative- 
constrain improves the high-recall/low-precision region by only voting among the am- 
biguous bags, but it suffers in high-precision/low recall region because some ambiguous 
bags may be erroneous. Our hybrid confidence score gets the best of both worlds. 


There are some problems with all of those choices, especially in the case where we have some 
errors in ambiguous label set (a ¢ Y for the true label a). This can occur for example if we restrict 
them with some heuristics to prune down the amount of ambiguity, such as the ones we consider in 
Section 8.6 (mouth motion cue, gender, etc). At low recall, we want maximum precision, therefore 
we cannot trust too much the heuristic used in relative-constrained confidence. At high recall, the 
errors in the classifier dominate the errors in ambiguous labels, and relative-constrained confidence 
gives better precision because of the restriction. We introduce a hybrid confidence measure that 
performs well for all recall levels r, interpolating between the two confidence measures: 


h(x) 8a(x) . ifa € y, 
(1—r)ga(x) +rminyg,(x) else. 
C-(g(x)) = Cra (h (x)). 


By design, in the limit r > 0, C,(g(x)) ~ Crei(g(x)). In the limit r > 1, h? (x) is small for a ¢ y and 
so C;(g(x)) © Crey (g(x). 


8.6 Additional Cues 


We investigate additional features to further improve the performance of our system: mouth motion, 
grouping constraints, gender. Final misclassification results are reported in Table 4. 
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8.6.1 MOUTH MOTION 


We use a similar approach to Everingham et al. (2006) to detect mouth motion during dialog and 
adapt it to our ambiguous label setting.” For a face track x with ambiguous label set y and a tem- 
porally overlapping utterance from a speaker a € {1..L} (after aligning screenplay and closed cap- 
tions), we restrict y as follows: 


{a} if mouth motion, 
yi=<y if refuse to predict or y = {a}, 


y— {a} if absence of mouth motion. 


8.6.2 GENDER CONSTRAINTS 


We introduce a gender classifier to constrain the ambiguous labels based on predicted gender. The 
gender classifier is trained on a data set of registered male and female faces, by boosting a set of 
decision stumps computed on Haar wavelets. We use the average score over a face track output by 
the gender classifier. We assume known the gender of names mentioned in the screenplay (using 
automatically extracted cast list from IMDB). We use gender by filtering out the labels that do not 
match by gender the predicted gender of a face track, if the confidence exceeds a threshold (one for 
females and one for males are set on a validation data to achieve 90% precision for each direction 
of the gender prediction). Thus, we modify ambiguous label set y as: 


y if gender uncertain, 
y:= § y—{a:ais male} if gender predicts female, 


y—{a:aisfemale} if gender predicts male. 


8.6.3 GROUPING CONSTRAINTS 


We propose a very simple must-not-link constraint, which states y; Æ yj if face tracks x;,x; are in 
two consecutive shots (modeling alternation of shots, common in dialogs). This constraint is active 
only when a scene has 2 characters. Unlike the previous constraints, this constraint is incorporated 
as additional terms in our loss function, as in Yan. et al. (2006). We also propose groundtruth 
grouping constraints for comparison: y; = yj for each pair of face tracks x;,x; of the same label, and 
that are separated by at most one shot. 


8.7 Ablative Analysis 


Figure 13 is an ablative analysis, showing error rate vs recall curves for different sets of cues. We see 
that the constraints provided by mouth motion help most, followed by gender and link constraints. 
The best setting (without using groundtruth) combines the former two cues. Also, we notice, once 
again, a significant performance improvement of our method over the naive method. 


8.8 Qualitative Results and Video Demonstration 


We show examples with predicted labels and corresponding accuracy, for various characters in 
C.S.I., see Figure 14. Those results were obtained with the basic system of Section 8.4. Full-frame 





5. Motion or absence of motion are detected with a low and high threshold on normalized cross-correlation around 
mouth regions in consecutive frames. 
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Figure 13: Ablative analysis. x-axis: recall; y-axis: error rate for character naming across 16 
episodes of Lost, and the 8, 16, and 32 most common labels (respectively for the left, 
middle, right plots). We compare our method, mean, to the Naive model and show the 
effect of adding several cues to our system. Link: simple must-not-link constraints from 
shot alternation, Gender: gender cue for simplification of ambiguous bags; Mouth: 
mouth motion cue for detecting the speaker with synchronous mouth motion; we also 
consider the combination Mouth+Gender, as well as swapping in perfect components 
such as Groundtruth link constraints and Groundtruth Mouth motion. 





Figure 14: Left: Examples classified as Catherine Willows in C.S./. data set using our method 
(zoom-in for details). Results are sorted by classifier score, in column major format; 
this explains why most of the errors occur in the last columns. The precision is 85.3%. 
Right: Examples classified as Sara Sidle in C.S... The precision is 78.3%. 


detections for Lost and C.S.I. data sets can be seen in Figure 10. We also propagate the predicted 
labels of our model to all faces in the same face track throughout an episode. Video results of several 
episodes can be found at the following website http://www. youtube. com/user/AmbiguousNaming. 


9. Conclusion 


We have presented an effective learning approach for partially labeled data, where each instance is 
tagged with more than one label. Theoretically, under reasonable assumptions on the data distribu- 
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tion, one can show that our algorithm will produce an accurate classifier. We applied our method to 
two partially-supervised naming tasks: one on still images and one on video from TV series. We also 
compared to several strong competing algorithms on the same data sets and demonstrated that our 
algorithm achieves superior performance. We attribute the success of the approach to better model- 
ing of the mutual exclusion between labels than the simple multi-label approach. Moreover, unlike 
recently published techniques that address similar ambiguously labeled problems, our method does 
not rely on heuristics and does not suffer from local optima of non-convex methods. 
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Appendix A. CLPL with Feature Selection Using Boosting 


We derive Algorithm 1 by taking the second order Taylor expansion of the loss Ly(g(x),y), with 
y(u) = exp(—u). The updates of the algorithm are similar to a multiclass version of Gentleboost 
(Friedman et al., 2000), but keep a combined weight v; for the positive example f(x;,y;) and weights 
Via for the negative examples f(x;,a),a ¢ yi. 


Algorithm 1 Boosting for CLPL with exponential loss 





1: Initialize weights: v,=1 Vi, Via=1 Vi,a¢y; 

2: fort =1...T do 

3: fora=1...L do 

4: Fit the parameters of each weak classifier u(x) to minimize the second-order Taylor 
approximation of the cost function with respect to the a’ classifier: 


5 [vi- L(a € yi (u(xi)/lyi| — 1) + via: L(a ¢ yi (u(x) +1)?°] + constant. 
L 
5 end for 
6: Choose the combination of u,a with lowest residual error. 
T: Update ga(x) = ga(x) + u(x) 
8: fori=1...mdo 
9 if a € y; then 
10: vi = vi exp(—u(x;)) 
11: else 
12: Via = Via‘ exp(u(x)) 
13: end if 
14: end for 
15: Normalize v to sum to 1. 
16: end for 
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Appendix B. Proofs 


Proof of Proposition 1 (Partial loss bound via ambiguity degree £). The first inequality comes 
from the fact that h(x) £ y => h(x) #y. For the second inequality, fix an x € X with P(X =x) >0 
and define Ep[- | x] as the expectation with respect to P(Y | X =x). 


Ep[La (h(x), Y)|x] = P(h(x) Z Y | X =x) = P(A(x) FY, A(x) ZZ | X =x) 
= ) PY =a|X =x)(1—-P(h(x) € Z| X =x,Y =a)) 






































oy) <e by definition 
> È PY =a|X =x)(1-€) = (1—-£)Ep[L(A(x),¥) Ia] 
a#h(x) 

















Hence, Ep[L£(A(x),Y)|x] < ~4:Ep[£La(A(x), Y)|x] for any x. We conclude by taking expectation 
over x.The first inequality is tight: equality can be achieved, for example, when P(y|x) is deter- 
ministic, and a perfect classifier h such that for all x, h(x) = y. The second inequality is also tight: 
for example consider the uniform case with a fixed ambiguity size |z| = C and for all x,y,z Æ y, 
P(z€z|X =x,Y =y) =C/(L—1). In the proof above (second inequality), the only inequality 
becomes an equality. In fact, this also shows that for any (rational) €, we can find a number of labels 
L, a distribution P and a classifer A such that there is equality. E 











Proof of Proposition 3 (Partial loss bound via (¢,5)). We split up the expectation in two parts: 
Ep[L(h(X),Y)] = Ep[L(A(X),Y)|(X,¥) € GIU — 8) + Ep[L(A(X), Y)|(X,¥) ¢ Gd 
Ep[L(h(X),Y)|(X,¥) € G](1—8) +8 

1 

< TE Ep|La(h(X),Y)|(X,Y) € G1 — 8) +ô. 
We applied Proposition 1 in the last step. Using a symmetric argument, 
Ep|La (A(X), Y)] = Ep[La(h(X), Y) (X,Y) € G](1 — ò) + Ep[La(h(X), Y)|(X,¥) g Gd 
> Ep[La(h(X),Y)|(X,¥) € G](1 —8). 
1 


), 
Finally we obtain Ep[£(h(X),Y)] < —Ep[L£a(A(X), Y)] +ô. a 


1—e 
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Proof of Proposition 4 (Label-specific partial loss bound). Fix x € X such that P(X = x) >0 
and P(Y = a|x) > 0 and define Ep[- | x,a] as the expectation w.r.t. P(Z | X =x,Y =a). We consider 
two cases: 


a) ifh(x)=a, Ep{£L4(h(X),Y) | x,a] = P(h(x) #a,h(x) Zy |X =x,Y =a) =0. 


b) ifh(x) 4a, Ep[L4(h(X),Y) | x,a] = P(h(x) € Z| X =x, Y =a) 
=1-P(h(x) €Z|X =x,Y =a) >1-€&. 
































We conclude by taking expectation over x: 
Ep[La(h(X),¥) | Y =a] = P(h(X) =al¥ =a)Ep[£a(h(X),Y) | A(X) = a,Y = a] 
+P(h(X) #aļ|Y =a) 
>0+P(h(xX)#a|Y=a 
= (1 —€,)-Ep[L(h(X),¥ 
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Proof of Proposition 5 (Partial label consistency). We assume g(x) is found by minimizing 
over an appropriately rich sequence of function classes (Tewari and Bartlett, 2005), in our case, 
as m — œ, G — R? . Hence we can focus on analysis for a fixed x (with P(X = x) > 0), writing 
8a = 8a(x), and for any sete C {1,...,L}, ge =Yaee Ba/le| and Pe = P(Y =e|X =x). We also write 
P, = P(a € Y|X = x) for any label a, and use shorthand Peg = Pouta} and ge,a = Seu{a}. We have: 


Ly(g) = P Pe (ves T Evs) ; 


a¢e 


Note that the derivative w’(-) exists and is non-positive and non-decreasing by assumption and 
w’(z) <0 for z < 0. The assumptions imply that y(—2) — œ, so assuming that P, < 1, minimizers 
are upper-bounded: g4 < œ. The case of P, = 0 leads to g4 —> —œ and it can be ignored without 
loss of generality, so we can assume that optimal g is bounded for fixed p with 0 < P, < 1. 

Taking the derivative of the loss with respect to g4 and setting to 0, we have the first order 
optimality conditions: 


dLy(g) x Pea (Sea) 7 
dga = 2 le] +1 (1 — Pa)W'(—ga) = 0. 








Now suppose (for contradiction) that at a minimizer g, b € argmax, gq but Py > Pp for some 
a € argmax, Py. Subtracting the optimality conditions for a,b from each other, we get 








3 FeaW (ea) E _ P) (~ga) — (1 — Ps) W'(—20)- 
c:a,béc 


Since ga < gp, Y'(8e,a) < Y (gen) and Y'(—8a) > Y'(—8p). Plugging in on both sides: 


(Pea — Pop) W' (Sep) 
c:a,b¢e lel +1 





> (P, — Pa)w'(—8p)- 


By dominance assumption, (Pea — Pe») > 0 and since (P, — P4) < 0 and w’(-) is non-positive, 
the only possibility of the inequality holding is that w'(—g,) = 0 (which implies g, > 0) and 
(Pea — Pep) W' (e,a) = 0 for all c. But (P, — P4) < O implies that there exists a subset ¢ such that 
(Pea — Pep) > 0. Since b € arg max g, Zeb < 8p, SO Ze, < 0, hence w'(gey) < 0, a contradiction. 


When P(y | x) is deterministic, let P(y|x) = 1(y =a). Clearly, if € < 1, then a = arg max y Py 
and P, = 1 > P,,Va' +a. Then the minimizer g satisfies either (1) gq —> œ (this happens if y’(-) <0 
for finite arguments) while gw are finite because of (1 — Py )y(—gu ) terms in the objective or (2) g 
is finite and the proof above applies since dominance holds: P,» = 0 if a ¢ c, so we can apply the 
theorem. 
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Proof of Proposition 6 (Comparison between partial losses). Let a* = argmaxge1,.18a(x). For 
the first inequality, if a* € y, Ly*(g(x),y) = 0 =2La(g(x),y). Otherwise a* ¢ y: 


Ly" (80),y) Z (max ga(x)) + W(—8ar (*)) 2 Wea (2) + W(—8er (*)) 


> ay (Eee) = 2y(0) > 2La(¢(x),9). 


The second inequality comes from the fact that 


max g(x) > = Sax 
acy = iy] E 


For the third inequality, we use the convexity of y: 


“(ii Lel ») sg Sy EV eels 


For the tightness proof: When g,(x) = constant over a € y, we have 


v (maxsa( )) = galx Y (8a(x 
“(gid »)- MÈ 


implying Ly” (8x), y) = Ly(g(x),y) = A” e ),y). 


As for the first inequality, we provide a sequence g™ that verifies equality in the limit: let 
gPa) = —l/nifa €y, g” (x) = 0 for some b ¢ y, and gi”) (x) = —n for all c ¢ y,c # b. Then 


provided w(0) = 1 and lim,-,..y(u) = 0, we have limy-, +. Lax (gn) (x),y) = 2 and for all n, 
Lale (x) y) =1. m 


Proof of Proposition 7 (Generalization bounds). The proof uses Definition 11 for Rademacher 
and Gaussian complexity, Lemma 12, Theorem 13 and Theorem 14 from Bartlett and Mendelson 
(2002), H produesd below and adapted to our notations for completeness. We apply Theorem 13 
with £ := ELA, O : = thy: 








8log(2/n) 


m 








Ep[La(g(X),¥)] < 








Es[Lyy(@(X),¥)] +Rn(bo G) + 








bi 
bi 
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From Lemma 12, Rin($o G) < 4Gn(o G). From Theorem 14, Gn(O0 G) < 2AE} Gin(Ga). Let 
(v;) be m independent standard normal random variables. 


























R 2 2 
Gin(Ga) = Ey [sp = D Vigala) | s| = Ey | sup Wa: Lvill Xi jis 
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8a© Ga I|Wal|<B 


= F, igvewns] -p |F Lvi xj)TE(x;) s] 















































IA 
| 














Z | vt tops] = 22 IF ARTs] 


m i 
2B 
= mf Sel xi)||?. 
m 


Putting everything together, Rn(do G) < 2AL Gn( Ga) < AALB VX; ||£@x)||? and: 


Erla (E00) SBsloy (eX) V+ ZE (PAT njee, 


The Lipschitz constant from 14 can be computed as A := i L, using the Lipschitz constant of the 
scalar function Wy, which is T and the fact that ||g (|l < VLIle l2. E 
































Definition 11 (Definition 2 from Bartlett and Mendelson (2002) ) Let u be a probability distri- 
bution on a set X and suppose that S = {x;}'"_, are independent samples sampled from u. Let G be 
a class of functions X — R. Define the random variables 








x p 
Rn( F) = Eo wp gore | s| ; 











A 2 
Gm F = Ky = i i S , 
(F) sup 2 Evre | 





where (0;) are m independent uniform {+1}-valued random variables and (v;) are m independent 
standard normal random variables. Then the Rademacher (resp. Gaussian) complexity of G is 


Rn(F) = Es[Rm(F)] (resp. Gn( F) = Es[Fin( F ))). 


























Rm( F) and Gm( F) quantify how much cana f € F be correlated with a noise sequence of length 
m. 


Lemma 12 (Lemma 4 from Bartlett and Mendelson (2002) ) There are absolute constants c and 
C such that for every class G and every integer m, 


cRm(G) < Gn(G) < ClogmRm(G). 
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Theorem 13 (Theorem 8 from Bartlett and Mendelson (2002) ) Consider a loss function £:A x 
Y + [0,1] and a dominating cost function 6: A x Y — [0,1], where A is an arbitrary output space. 
Let G be a class of functions mapping from X to A and let S = {(xj,yi) }/, be independently se- 
lected according to the probability measure P. Define ġo G = {(x,y) > 0(g(x),y) —0(0,y): g € G}. 
Then, for any integer m and any n € (0,1), with probability at least 1 —1 over samples of length m, 
Vg eG: 








Bp[L(e(X),¥)] < Eso (X), y) +R (Bo G) + af ECM, 

Theorem 14 (Theorem 14 from Bartlett and Mendelson (2002) ) Let A = RŁ, and let G bea 
class of functions mapping X to A. Suppose that there are real-valued classes G1, ..., GL such that G 
is a subset of their direct sum. Assume further that 9: A x Y — R is such that, for ally € Y, 0(-,y) is 
a Lipschitz function (with respect to Euclidean distance on A) with constant À which passes through 
the origin and is uniformly bounded. For g € G, define ġo g as the mapping (x,y) > 0(g(x),y). 
Then, for every integer m and every sample S = { (xi, yi) ¥ -p 




















L 
Gn(oo G)< 2AE Gn(Ga) 


where Gm(o G) are the Gaussian averages of ġo G with respect to the sample {(xi,y;)}", and 
Gin( Ga) are the Gaussian averages of Ga with respect to the sample {x;}""_,. 


Proof of Proposition 8 (Generalization bounds on true loss). This follows from Propositions 7 
and 1. E 


Proof of Lemma 9. Let us write z = z(x), y = y(x), y = y(x). 


iy 
e Let a € z. By hypothesis, 4x’ € By(x) : ga(x’) < — 1. By definition of By (x), 





n 

5° 

In fact, we also have g4(x) < 1, by considering two cases (Wa = 0 or W4 # 0) and using the 
fact that ||f(x) —f(x’)|| <n. 


Balx) = 8a(x’) + Wa: (f(x) — £0’) < a(x’) + ||Wall’n < ga’) +01 < 


e Leta ¢y. Since Ly(g(x),y) < w(n/2) and each term is nonnegative, we have: 


Yg < WD) = gale) <- 3}. 
e Leta =y. Ly(g(x),y) < w(n/2) also implies the following: 
V (jf Loeyso(2)) Sv) 
= fLeat) 27 
= ae) >M Lep) 
bez 
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Finally, Va Æ y, ga(x) < gy(x) and g classifies x correctly. E 





Proof of corollary 10. Let a € z(x), by the empty intersection hypothesis, Ji > 1 : a ¢ z(x;) and 


since y(x;) = y(x) and a Æ y(x) we also have a ¢ y(x;). Since Ly(g(xi),y(xi) < W(n/2), we have 
8a(Xi) < —73, as in the previous proof. We can apply Lemma 9 (with x’ = x;). a 
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